Classification and reconstruction of spatially overlapping phase images using diffractive optical networks

Diffractive optical networks unify wave optics and deep learning to all-optically compute a given machine learning or computational imaging task as the light propagates from the input to the output plane. Here, we report the design of diffractive optical networks for the classification and reconstruction of spatially overlapping, phase-encoded objects. When two different phase-only objects spatially overlap, the individual object functions are perturbed since their phase patterns are summed up. The retrieval of the underlying phase images from solely the overlapping phase distribution presents a challenging problem, the solution of which is generally not unique. We show that through a task-specific training process, passive diffractive optical networks composed of successive transmissive layers can all-optically and simultaneously classify two different randomly-selected, spatially overlapping phase images at the input. After trained with ~ 550 million unique combinations of phase-encoded handwritten digits from the MNIST dataset, our blind testing results reveal that the diffractive optical network achieves an accuracy of > 85.8% for all-optical classification of two overlapping phase images of new handwritten digits. In addition to all-optical classification of overlapping phase objects, we also demonstrate the reconstruction of these phase images based on a shallow electronic neural network that uses the highly compressed output of the diffractive optical network as its input (with e.g., ~ 20–65 times less number of pixels) to rapidly reconstruct both of the phase images, despite their spatial overlap and related phase ambiguity. The presented phase image classification and reconstruction framework might find applications in e.g., computational imaging, microscopy and quantitative phase imaging fields.

. Schematic of a diffractive optical network that can all-optically classify overlapping phase objects despite phase ambiguity at the input; this diffractive optical network also compresses the input spatial information at its output plane for simultaneous reconstruction of the individual phase images of the overlapping input objects using a back-end electronic neural network. (a) Optical layout of the presented 5-layer diffractive optical networks that can all-optically classify overlapping phase objects, e.g., phase-encoded handwritten digits, despite the phase ambiguity at the input plane due to spatial overlap. The diffractive optical network processes the incoming object waves created by the spatially overlapping, phase-encoded digits e.g., '6' and '7' , to correctly reveal the classes of both input objects (green). A separately trained shallow electronic neural network (with 2 hidden layers) rapidly reconstructs the individual phase images of both input objects using the optical signals detected at the output plane of the diffractive optical network. (b-d) Different detector configurations and class encoding schemes at the output plane of a diffractive optical network, devised to represent all the possible data class combinations at the input field-of-view created by overlapping phase objects.

Scientific Reports
| (2022) 12:8446 | https://doi.org/10.1038/s41598-022-12020-y www.nature.com/scientificreports/ phase images containing spatially overlapping MNIST digits (from the same class as well as different classes); blind testing of one of the resulting diffractive optical networks using 10,000 test images of overlapping phase objects revealed a classification accuracy of > 85.8%, optically matching the correct labels of both phase objects that were spatially overlapping within the input field-of-view.
In addition to all-optical classification of overlapping phase images using a diffractive optical network, we also combined our diffractive optical network models with separately trained, electronic image reconstruction networks to recover the individual phase images of the spatially overlapping input objects solely based on the optical class signals collected at the output of the corresponding diffractive optical network. We quantified the success of these digitally reconstructed phase images using the structural similarity index measure (SSIM) and the peak-signal-to-noise-ratio (PSNR) to reveal that a shallow electronic neural network with 2 hidden layers can simultaneously reconstruct both of the phase objects that are spatially overlapping at the input plane despite the fact that the number of detectors/pixels at the output plane of the diffractive optical network is e.g., ~ 20-65 times smaller compared to an ideal diffraction-limited imaging system. This means the diffractive optical network encoded the spatial features of the overlapping phase objects into a much smaller number of pixels at its output plane, which was successfully decoded by the shallow electronic network to simultaneously perform two tasks: (1) image reconstruction of overlapping spatial features at the input field-of-view, and (2) image decompression.
We believe that the presented diffractive optical network training and design techniques for computational imaging of phase objects will enable memory-efficient, low-power and high frame-rate alternatives to existing phase imaging platforms that often rely on high-pixel count sensor arrays, and therefore might find applications in e.g. microscopy and quantitative phase imaging fields.

Results
Spatial overlap between phase objects within the input field-of-view of an optical imaging system obscures the information of samples due to the superposition of the individual phase channels, leading to loss of structural information. For thin phase-only objects (such as e.g., cultured cells or thin tissue sections), when two samples e jθ 1( x,y) and e jθ 2( x,y) overlap with each other in space, the resulting object function can be expressed as e j(θ 1( x,y)+θ 2( x,y)) , and therefore a coherent optical imaging system does not have direct access to θ 1 x, y or θ 2 x, y , except their summation (see Fig. 1a). In the context of diffractive optical networks and all-optical image classification tasks, another challenging aspect of dealing with spatially overlapping phase objects is that the effective number of data classes represented by different input images significantly increases compared to a singleobject classification task. Specifically, for a target dataset with M data classes represented through the phase channel of the input, the total number of data classes at the input (with two overlapping phase objects) becomes To mitigate this challenge, in this work we introduced different class encoding schemes that better handle the all-optical classification of these large number of possible class combinations at the input. The output detector layout, D-1, shown in Fig. 1b illustrates one alternative design strategy where the problem of classification of overlapping phase objects is solved by using only 2M individual detectors with a significant reduction in the number of output detectors when compared to M(M−1) 2 + M . The use of 2M single-pixel detectors at the output plane (see Fig. 1b), can handle all the combinations and classify the overlapping input phase objects even if they belong to the same data class or not. To achieve this, we have two different sets of detectors, D i m , m = 0, 1, 2, . . . , M − 1 and D ii m , m = 0, 1, 2, . . . , M − 1 , which represent the classes of the individual overlapping phase images. The final class assignments in this scheme are given based on the largest two optical signals among all the 2M detectors, where the assigned indices (m) of the corresponding two winner detectors indicate the all-optical classification results for the overlapping phase images. This is a simple class decision rule with a look up table of detector-class assignments (as shown Fig. 1b), where the strongest two detector signals indicate the inferred classes based on their m . Stated mathematically, the all-optical estimation of the classes, ĉ = c 1 , c 2 , of the overlapping phase images is given by, where I denotes the optical signals detected by 2M individual detectors, i.e., [D i m , D ii m ]. With the mod( * ) operation in Eq. (1), it can be observed that when the ground truth object classes, c 1 and c 2 , are identical, a correct optical inference would result in c 1 = c 2 . On the other hand, when c 1 = c 2 , there are four different detector combinations for the two largest optical signals that would result in the same ( c 1 , c 2 ) pair according to our class decision rule. For example, in the case of the input transmittance shown in Fig. 1a, which is comprised of handwritten digits '6' and '7' , the output object classes based on our decision rule would be the same if the two largest optical signals collected by the detectors correspond to: (1) D i 6 and D ii 7 , (2) D i 7 and D ii 6 , (3) D i 6 and D i 7 or (4) D ii 6 and D ii 7 ; all of these four combinations of winner detectors at the output plane would reveal the correct classes for the input phase objects in this example (digits '6' and '7').
Therefore, the training the diffractive optical networks according to this class decision rule requires subtle but vital changes in the ground truth labels representing the inputs and the loss function driving the evolution of the diffractive layers compared to a single-object classification system. If we denote the one-hot vector labels representing the classes of the input objects in a single-object classification system as, g 1 and g 2 , with an entry www.nature.com/scientificreports/ of 1 at their c th 1 and c th 2 entries, respectively, for the case of spatially overlapping two phase objects at the input field-of-view we can define new ground truth label vectors of length 2M using g 1 and g 2 . For the simplest case of c 1 = c 2 (i.e., g 1 = g 2 ), the 2M-vector g e is constructed as g e = 0.5 × g 1 , g 2 . The constant multiplicative factor of 0.5 ensures that the resulting vector g e defines a discrete probability density function satisfying 2M 1 g e m = 1 . It is important to note that since c 1 = c 2 , we have g 1 , g 2 = g 1 , g 2 . On the other hand, when the overlapping input phase objects are from different data classes i.e., c 1 = c 2 , we define four different label vectors g a , g b , g c , g d representing all the four combinations. Among this set of label vectors, we set g a = 0.5 × g 1 , g 2 and g b = 0.5 × g 2 , g 1 . The label vectors g c and g d depict the cases, where the output detectors corresponding to the input object classes lie within D i m and D ii m , respectively. In other words, the c th 1 and c th 2 entries of g c are equal to 0.5, and similarly the (M + c 1 ) th and (M + c 2 ) th entries of g d are equal to 0.5, while all the rest of the entries are equal to zero.
Based on these definitions, the training loss function ( L ) of the associated forward model was selected to reflect all the possible input combinations at the sample field-of-view (input), therefore, it was defined as, and L e c denote the penalty terms computed with respect to the ground truth label vectors g a , g b , g c , g d , and g e , respectively, and sgn(.) is the signum function. The classification errors, L x c , are computed using the cross-entropy loss 26 where x refers to one of a, b, c, d, or e, I m denotes the normalized intensity collected by a given detector at the output plane (see the Methods section for further details). The term g x m in Eq. (3) denotes the mth entry of the ground truth data class vector, g x .
Based on this diffractive optical network design scheme and the output detector layout D-1, we trained a 5-layer diffractive optical network (Fig. 1a,b) using the loss function depicted in Eq. (2) over ~ 550 million input training images containing various combinations of spatially overlapping, phase-encoded MNIST handwritten digits. Following the training phase, the resulting diffractive layers of this network, which we term as D 2 NN-D1, are illustrated in Fig. 2a. To quantify the generalization performance of D 2 NN-D1 for the classification of overlapping phase objects that were never seen by the network before, we created a test dataset, T 2 , with 10 K phase images, where each image contains two spatially-overlapping phase-encoded test digits randomly selected from the standard MNIST test set, T 1 . In this blind testing phase, D 2 NN-D1 achieved 82.70% accuracy on T 2 , meaning that in 8,270 cases out of 10,000 test inputs, the class estimates c 1 , c 2 at the diffractive optical network's output plane were correct for both of the spatially overlapping handwritten digits. For the remaining 1730 test images, the classification decision of the diffractive optical network is incorrect for at least one of the phase objects within the field-of-view. Figure 2b-e depict some of the correctly classified phase image examples from the test dataset T 2 with phase encoded handwritten digits, along with the resulting class scores at the output detectors of the diffractive optical network.
This blind inference accuracy of the diffractive optical network shown in Fig. 2a, i.e., D 2 NN-D1, can be further improved by combining the above outlined training strategy with a differential detection scheme, where each output detector in D1 (Fig. 1b) is replaced with a differential pair of detectors (i.e., a total of 2 × 2 M detectors are located at the output plane, see Fig. 1c). The differential signal between a pair of detectors shown in Fig. 1c encodes a total of 2xM differential optical signals and similar to the previous approach of D1, the final class assignments in this scheme are given based on the two largest signals among all the differential optical signals. With the incorporation of this differential detection scheme, the vector I in Eq. (1) is replaced with the differential signal 5 , I = I + − I − , where I + and I − denote the optical signals collected by the 2M detector pairs, virtually representing the positive and negative parts, respectively.
Using this differential diffractive optical network design, which we termed as D 2 NN-D1d (see Fig. 1c), we achieved a blind testing accuracy of 85.82% on the test dataset T 2 . The diffractive layers comprising the D 2 NN-D1d network are shown in Fig. 3a, which were trained using ~ 550 million input phase images of spatially overlapping MNIST handwritten digits, similar to D 2 NN-D1. Compared to the classification accuracy attained by D 2 NN-D1, the inference accuracy of its differential counterpart, D 2 NN-D1d, is improved by > 3.1% at the expense of using 2M additional detectors at the output plane of the optical network. Figure 3b-e illustrate some examples of the correctly classified phase images from the test dataset T 2 with phase encoded handwritten digits, along with the resulting differential class scores at the output detectors of the diffractive optical network.
The blind inference accuracies achieved by D 2 NN-D1 and D 2 NN-D1d (82.70% and 85.82%, respectively) on the test dataset T 2 , demonstrate the success of the underlying detector layout designs and the associated training strategy for solving the phase ambiguity problem to all-optically classify overlapping phase images using diffractive optical networks. When these two diffractive optical networks (D 2 NN-D1 and D 2 NN-D1d) are blindly tested over T 1 that provides input images containing a single phase-encoded handwritten digit (without the second overlapping phase object), they attain better classification accuracies of 90.59% and 93.30%, respectively (see the Methods section). As a reference point, a 5-layer diffractive optical network design with an identical layout to the one shown in Fig. 1a, can achieve a blind classification accuracy of ~ 98% 5,8 on test set T 1 , provided that it is trained to classify only one phase-encoded handwritten digit per input image (without any spatial  (2) and (3), guided the evolution of the corresponding diffractive layers to recognize the spatial features  www.nature.com/scientificreports/ created by the overlapping handwritten digits, as opposed to focusing solely on the actual features describing the individual handwritten digits.
To further reduce the required number of optical detectors at the output plane of a diffractive optical network, we considered an alternative design (D-2) shown in Fig. 1d. In this alternative design scheme D-2, there are two extra detectors D + Q , D − Q (shown with blue in Fig. 1d), in addition to M class detectors {D m , m = 1, 2, . . . , M} (shown with gray in Fig. 1d). The sole function of the additional pair of detectors D + Q , D − Q is to decide whether the spatially-overlapping input phase objects belong to the same or different data classes. If the difference signal of this differential detector pair (Fig. 1d) is non-negative (i.e., I D + , the diffractive optical network will infer that the overlapping input objects are from the same data class, hence there is only one class assignment to be made by simply determining the maximum signal at the output class detectors: Refer to the Methods section for further details on the training of diffractive optical networks that employ D-2 ( Fig. 1d).
Similar to earlier diffractive optical network designs, we used ~ 550 million input phase images of spatially overlapping MNIST handwritten digits to train 5 diffractive layers constituting the D 2 NN-D2 network (see Fig. 4a). Figure 4b-d illustrate sample input phase images that contain objects from different data classes, along with the output detector signals that correctly predict the classes/digits of these overlapping phase objects; notice that in each one of these cases, we have at the output plane I D + indicating the success of the network's inference. As another example of blind testing, Fig. 4e reports the diffractive optical network's inference for two input phase objects that are from the same data class, i.e., digit '3' . At the network's output, this time we have , correctly predicting that the two overlapping phase images are of the same class; the maximum output signal of the remaining output detectors {D m , m = 1, 2, . . . , M} also correctly reveals that the handwritten phase images belong to digit '3' with a maximum signal at D 3 . This D 2 NN-D2 design provides 82.61% inference accuracy on the test set T 2 with 10 K test images, closely matching the inference performance of D 2 NN-D1 (82.70%) reported in Fig. 2. In fact, an advantage of this D 2 NN-D2 design lies in its inference performance and blind testing accuracy on test set T 1 , achieving 93.38% for classification of input phase images of single digits (without any spatial overlap at the input field-of-view).
We also implemented the differential counterpart of the detector layout D-2, which we term as D-2d (see Fig. 1e), where the M class detectors in D-2 are replaced with M differential pairs of output detectors. In this configuration D-2d, the total number of detectors at the output plane of the diffractive optical network becomes 2M + 2 and the all-optical inference rules remain the same as in D-2: for , the class inference is made by simply determining the maximum differential signal at the output class detectors, and for the case of I D + Q < I D − Q the inference of the classes of input phase images is determined based on the largest two differential optical signals at the network output. Figure 5a shows the diffractive layers of the resulting D 2 NN-D2d that is trained based on the detector layout, D-2d (Fig. 1e) using the same training dataset as before: ~ 550 million input phase images of spatially overlapping, phase-encoded MNIST handwritten digits. This new differential diffractive optical network design, D 2 NN-D2d, provides significantly higher blind inference accuracies compared to its non-differential counterpart D 2 NN-D2, achieving 85.22% and 94.20% on T 2 and T 1 datasets, respectively. Figure 5b-d demonstrate some examples of the input phase images from test set T 2 that are correctly classified by D 2 NN-D2d along with the corresponding optical signals collected by the output detectors representing the positive and negative parts, I M+ and I M− , of the associated differential class signals, I M = I M+ − I M− . As another example, the input phase image depicted in Fig. 5e has two overlapping phase-encoded digits from the same data class, handwritten digit '4' , and the diffractive optical network correctly outputs I D + Q > I D − Q with the maximum differential class score strongly revealing an optical inference of digit '4' . Table 1 summarizes the optical blind classification accuracies achieved by different diffractive optical network designs, D 2 NN-D1, D 2 NN-D1d, D 2 NN-D2 and D 2 NN-D2d on test image sets T 2 and T 1 . Even though D 2 NN-D1d achieves the highest inference accuracy for the classification of spatially overlapping phase objects, D 2 NN-D2d offers a balanced optical inference system achieving very good accuracy on both T 1 and T 2 . These two differential diffractive optical network models outperform their non-differential counterparts with superior inference performance on both T 2 and T 1 . The confusion matrices demonstrating the class-specific inference performances of the presented diffractive optical networks, D 2 NN-D1, D 2 NN-D1d, D 2 NN-D2 and D 2 NN-D2d, are also reported in Supplementary Figs. 1-4, respectively.
Next, we aimed to reconstruct the individual images of the overlapping phase objects (handwritten digits) using the detector signals at the output of a diffractive optical network; stated differently our goal here is to resolve the phase ambiguity at the input plane and reconstruct both of the input phase images, despite their spatial overlap and the loss of phase information. For this aim, we combined each one of our diffractive optical networks, D 2 NN-D1, D 2 NN-D1d, D 2 NN-D2 and D 2 NN-D2d, one by one, with a shallow, fully-connected (FC) electronic network with two hidden layers, forming a task-specific imaging system as shown in Fig. 6. In these hybrid machine vision systems, the optical signals synthesized by a given diffractive optical network (frontend encoder) are interpreted as encoded representations of the spatial information content at the input plane. Accordingly, the electronic back-end neural network is trained to process the encoded optical signals collected by the output detectors of the diffractive optical network to decode and reconstruct the individual phase images describing each object function at the input plane, resolving the phase ambiguity due to the spatial overlap of the two phase objects. Figure 6a- Fig. 6, the electronic image reconstruction networks only have 2 hidden layers with 100 and 400 neurons, and the final output layers of these networks have 28 × 28 × 2 neurons, revealing the images of the individual phase objects, resolving the phase ambiguity due to the spatial overlap of the input phase images. The quality of these image reconstructions is quantified using (1) the structural similarity index measure (SSIM) and (2) the peak signal-to-noise ratio (PSNR). Table 2 shows the mean SSIM and PSNR values achieved by these hybrid machine vision systems along with the corresponding standard deviations computed over the entire 10 K test images (T 2 ). For these presented image reconstructions, we should emphasize that the dimensionality reduction (i.e., the image data compression) between the input and output planes of the diffractive optical networks (D 2 NN-D1, D 2 NN-D1d, D 2 NN-D2 and D 2 NN-D2d) is 39.2 × , 19.6 × , 65.33 × and 35.63 × , respectively, meaning that the spatial information of the overlapping phase images at the input field-of-view is significantly compressed (in terms of the number of pixels) at the output plane of the diffractive optical network. This large compression sets another significant challenge for the image reconstruction task in addition to the phase ambiguity and spatial overlap of the target images. With these large compression ratios, the presented diffractive optical network-based machine vision systems managed to faithfully recover the phase images of each input object despite their spatial overlap and phase information loss, demonstrating the coherent processing power of diffractive optical networks as well as the unique design opportunities enabled by their collaboration with electronic neural networks that form task-specific back-end processors.

Discussion
The optical classification of overlapping phase images using diffractive optical networks presents a challenging problem due to the spatial overlap of the input images and the associated loss of phase information at the input plane. Interestingly, different combinations of handwritten digits at the input present different amounts of spatial overlap, which is a function of the class of the selected input digits as well as the style of the handwriting of the person. To shed more light on this, we quantified the all-optical blind inference accuracies of the presented diffractive optical networks as a function of the spatial overlap percentage, ξ , at the input field-of-view; see Fig. 7.
In the first group of examples shown in Fig. 7a, the input fields-of-view contain digits from different data classes ( c 1 = c 2 ) and in the second group of examples shown in Fig. 7b, the spatially overlapping objects are from the same data class, c 1 = c 2 . The input phase images in T 2 exhibit spatial overlap percentages varying between ~ 20% and ~ 100%. Figure 7c,d illustrate the change in the optical blind inference accuracy of the diffractive optical network, D 2 NN-D1, as a function of the spatial overlap percentage, ξ , for the first ( c 1 = c 2 ) and the second ( c 1 = c 2 ) group of test input images, respectively. When c 1 = c 2 as in Fig. 7c, the optical inference accuracy is hindered by the increasing amount of spatial overlap between the two input phase objects, as in this case, the spatial features of the effective input transmittance function significantly deviate from the features defining the individual data classes. In the other case shown in Fig. 7d, i.e., c 1 = c 2 , the relationship between the spatial overlap ratio ξ and the blind inference accuracy is reversed, since, with c 1 = c 2 , increasing ξ means that the effective phase distribution at the input plane resembles more closely to a single object/digit. The same behavior can also be observed for the other diffractive optical networks, D 2 NN-D1d, D 2 NN-D2 and D 2 NN-D2d, reported in Fig. 7e-j, respectively. Although our forward model during the training assumes that the input field-of-view contains two overlapping phase objects, we further tested the behavior of our diffractive models, D 2 NN-D1d and D 2 NN-D2d, for a hypothetical case, where more than two objects are present simultaneously at the input field-of-view as illustrated in Supplementary Fig. 5. Specifically, we conducted two different types of tests. In the first one, we created input objects with N = 3 , N = 4 and N = 5 spatially-overlapping phase-encoded handwritten digits from different classes and quantified the accuracy of the diffractive optical networks in determining the classes of the input objects correctly based on arg max n , where n = 2, 3, .., N . As shown in Supplementary Fig. 5, with N = 5 phase objects spatially-overlapping within the same input field-of-view, D 2 NN-D1d can still correctly classify at least two of the overlapping objects, i.e., n = 2 , with an accuracy of 84.06% despite the severe contamination of the   Fig. 1b- Fig. 5a). This performance degradation is mainly because the diffractive optical network has never seen more than two objects overlapping at the input field-of-view during its training phase. To further explore the impact of the number of overlapping phase objects used during the training phase on the blind testing accuracy, we extended the detector configuration depicted in Fig. 1c, D-1d, to accommodate 2 × 5 M = 100 single-pixel detectors and trained from scratch a new diffractive optical network that was tasked to classify 5 overlapping handwritten digits. Compared to the classification performance of the D 2 NN-D1d shown in Fig. 3 with N = 5 spatially-overlapping phase objects, this new diffractive network achieved much higher classification accuracies, reaching 99.66%, 93.37%, 55.63% and 12.62% for n = 2 , n = 3 , n = 4 and n = N = 5 , respectively (see Supplementary Fig. 6).
In the second testing scheme, we created input phase objects with N = 4, 6, 8, 10 and 12 spatially-overlapping digits from the same data class and quantified the performance of the diffractive networks in understanding that the objects inside the field-of-view represent only one data class/digit. The classification accuracies attained by D 2 NN-D1d and D 2 NN-D2d, in this case, stay above > 92% up to 4 overlapping samples from the same data class, with the former and the latter diffractive optical network achieving 92.81% and 92.88%, respectively. On the other hand, if more than 5 objects are overlapping within the input field-of-view, the classification accuracies of both of the diffractive models decrease below 80%, hinting at a significant spatial information loss caused by the superposition of a large number of input phase objects (see Supplementary Figs. 5b and 5d).
Next, to test the generalization of the presented diffractive optical framework over different datasets, we trained two new separate diffractive optical networks (each with 5 diffractive layers) that were tasked with the classification of spatially-overlapping, phase-encoded objects from a more challenging image dataset, i.e., Fashion-MNIST. In these two diffractive optical networks, the detector designs were identical to D1d and D2d shown in Fig. 1c,e, respectively. While the D 2 NN-D1d shown in Fig. 3 can achieve 85.82% accuracy for the classification of overlapping handwritten digits, its equivalent (see Supplementary Fig. 7) that is trained and tested on Fashion-MNIST dataset can attain 73.28% bling testing accuracy for the classification of overlapping, phase-encoded fashion products. The same comparison also reveals a similar classification accuracy for the D 2 NN-D2d model, which can classify the phase-encoded, spatially-overlapping fashion products within the input field-of-view with a classification accuracy of 72.21% (see Supplementary Fig. 8). In addition, we combined these two diffractive optical networks classifying overlapping fashion products with shallow, electronic image reconstruction networks, each having only 2 hidden layers. The structural quality of the images reconstructed by these subsequent shallow electronic networks using only the all-optical classification signals synthesized by the corresponding diffractive optical networks is still very good as shown in Supplementary Fig. 9. In summary, to the best of our knowledge, this manuscript reports the first all-optical multi-object classification designs based on diffractive optical networks demonstrating their potential in solving challenging classification and computational imaging tasks in a resource-efficient manner using only a handful detectors at the output plane. In the context of optical classification and reconstruction of overlapping phase objects, also resolving the phase ambiguity due to the spatial overlap of input images, this study exclusively focuses on a setting where the thin input objects are only modulating the phase of the incoming waves, and absorption is negligible. Without loss of generality, the presented diffractive design schemes with the associated loss functions and training methods can also be extended to applications, where the input objects partially absorb the incoming light.

Methods
Optical forward model of diffractive optical networks. D 2 NN framework formulates a given machine learning e.g., object classification or inverse design task as an optical function approximation problem and parameterizes that function over the physical features of the materials inside a diffractive black-box. As is the case in this study, this optical black-box is often modeled through a series of thin modulation layers connected by the diffraction of light waves. Here, we focused our efforts on 5-layer diffractive optical networks as shown in Fig. 1a, each occupying an area of 106 × 106 on the lateral space with denoting the wavelength of the illumination light. The modulation function of each diffractive layer was sampled and represented over a 2D regular grid with a period of 0.53 resulting in N = 200 × 200 different transmittance coefficients, i.e., the diffractive 'neurons' . Based on the 0.53 diffractive feature size, we set the layer-to-layer axial distance to be 40 to ensure connectivity between all the neurons on two successive layers.
Following the same framework used in earlier demonstrations of diffractive optical networks with 3D printed layers 4,6,8 , we selected the diffractive layer thickness, h , as a trainable physical parameter dictating the transmittance of each neuron together with the refractive index of the material. To limit the material thickness range Table 2. The comparison of the presented diffractive optical networks, in terms of (1) all-optical overlapping object classification accuracies on T 2 and (2) the quality of the image reconstruction achieved through separately-trained, shallow, electronic networks (decoder). The mean SSIM and PSNR values and their standard deviations were computed over the entire 10 K blind test inputs (T 2 ).  where the function, Q * (.) represents the *-bit quantization operator and h m is the maximum allowed material thickness. If the material thickness over the i th diffractive neuron located at x i , y i , z i is denoted by h x i , y i , z i , then the resulting transmittance coefficient of that neuron, t x i , y i , z i , is given by, where n( ) and κ( ) are the real and imaginary parts of complex-valued refractive index of the diffractive material at wavelength, . Following the earlier experimental demonstrations of diffractive optical networks, in this work we set the n( ) and κ( ) values to be 1.7227 and 0.031, respectively 8 . The parameter n s in Eq. (5) refers to the refractive index of the medium, surrounding the diffractive layers; without loss of generality, we assumed n s = 1 (air). Based on the outlined material properties, the h m and h b in Eq. (4) were selected as 2λ and 0.66λ, respectively, ensuring that the phase modulation term in Eq. (5), (n( ) − n s ) 2πh(x i ,y i ,z i ) , can cover the entire [0-2π] phase modulation range per diffractive feature/neuron. In this work, the light propagation between the diffractive layers was modeled using the Rayleigh-Sommerfeld diffraction integral, which assumes that the propagating light can be expressed as a scalar field. The exact modeling and computation of the light field diffracted by subwavelength features, in general, require the use of vector diffraction theory 27,28 . On the other hand, since we assume that (1) the diffractive layers are axially separated from each other by several tens of wavelengths, without carrying forward any evanescent fields, and (2) the smallest feature size on a diffractive layer is approximately half a wavelength, the use of scalar fields to represent the spatial information flow within a diffractive optical network is an acceptable approximation, just like employed in the simulation/modeling of diffraction-limited holographic imaging or display systems. In fact, various experimental demonstrations of 3D-fabricated diffractive optical networks designed using the same scalar field theory were reported in the literature 1,4,6,8,12,29 , confirming the validity of this assumption. According to the Rayleigh-Sommerfeld theory of diffraction, a neuron located at x i , y i , z i can be viewed as the source of a secondary wave, where r denotes the radial distance (x − x i ) 2 + y − y i 2 + (z − z i ) 2 . Based on this, the output wave emanating from the ith neuron on the kth layer, u k i x, y, z can be written as,

Optical classification Image reconstruction
The term N q=1 u k−1 q x i , y i , z i in Eq. (9) represents the wave incident on the i th neuron on the k th layer, generated by the neurons on the previous, (k − 1) th diffractive layer.
In this study we also assumed that the transmittance function inside the input field-of-view, T in x, y , covers an area of 53 × 53 and without loss of generality, it is illuminated with a uniform plane wave. At the output plane, the width of each single-pixel detector was set to be 6.36 on both x and y directions for all four output detector configurations shown in Fig. 1b-d. Based on the outlined optical forward model, the diffractive optical networks process the incoming waves generated by the complex-valued transmittance function, T in x, y , formed by the overlapping thin phase objects and synthesize a 2D intensity distribution at the output plane for all-optical inference of the classes of the overlapping objects. The optical intensity distribution within the active area of each output detector is integrated to form elements of the vector, I in Eq. (1). The number of elements in this optical signal vector, I , is equal to the number of output detectors, thus its length is 2M , 4M, M + 2 and 2M + 2 for D 2 NN-D1, D 2 NN-D1d, D 2 NN-D2 and D 2 NN-D2d, respectively. As part of our forward training model, I is normalized to form, I, where the coefficient c in Eq. (8) serves as the temperature parameter of the softmax function depicted in Eq. (3), and it was empirically set to be 0.1 for training of all the diffractive optical networks. It is important to note that this normalization step in Eq. (8) is only used during the training stage, and once the training is finished, the forward inference directly uses the detected intensities to decide on the object classes based on the corresponding decision rules. While the vector I is directly used in Eq. (3) for the D 2 NN-D1 network, in the case of D 2 NN-D1d, I is split into two vectors of length 2M , i.e., I + and I − , representing the signals collected by the positive and negative detectors, and the associated differential signal is computed as I = I + − I − . Accordingly, during the training of D 2 NN-D1d, the loss function depicted in Eq. (3), were computed using I instead of I.  www.nature.com/scientificreports/ For the diffractive optical network D 2 NN-D2, the output of the normalization defined in Eq. (8) was split into two: I M and I Q . The first part, I M , represents the optical signals coming from the M class specific detectors in the detector layout D-2 (the gray detectors Fig. 1d). The latter, I Q , contains two entries describing the positive and the negative parts of the indicator signals, I D + Q and I D − Q (see the blue detectors Fig. 1d). These two extra detectors, D + Q , D − Q , form a differential pair that controls the functional form of the class decision rule based on the sign of the difference between the optical signals collected by these detectors. We accordingly determine the class assignments as follows, To enable the training of diffractive optical networks according to the class assignment rule in Eq. (9), we defined a loss function, L = L Q + L c , that corresponds to the superposition of two different penalty terms, L Q and L c . Here, L Q represents the error computed with respect to the binary ground truth indicator signal, g Q , Accordingly, L Q was defined as, where σ (·) denotes the sigmoid function. The classification loss, L c , on the other hand, is identical to the crossentropy loss depicted in Eq. (3), except that the vector I is replaced with I M . Unlike the previous diffractive optical networks (D 2 NN-D1 and D 2 NN-D1d), the forward model of the diffractive optical networks trained based on the output detector layout D-2 do not require multiple ground truth vector labels. Simply the ground truth label vector of a given input field-of-view is defined as g = Testing of diffractive optical networks on the dataset T 1 . During the blind testing of the presented diffractive optical networks on the test set T 1 , the class estimation solely uses the arg max operation, searching for the highest class-score synthesized by the diffractive optical networks, based on the associated output plane detector layouts shown in Fig. 1. The purpose of this performance quantification using T 1 is to reveal whether the diffractive optical networks trained based on overlapping input phase objects can learn and automatically recognize the characteristic spatial features of the individual handwritten digits (without any spatial overlap). For this goal, in the case of D 2 NN-D1 and D 2 NN-D1d, the class estimation rule in Eq. (1) was replaced with, mod arg max (I), M and mod arg max (�I), M , respectively. Since the input images in the test set T 1 contain single, phase-encoded handwritten digits without the second overlapping phase object, the optical signals collected by the detectors, D + Q , D − Q , at the output plane of the diffractive optical networks D 2 NN-D2 and D 2 NN-D2d become irrelevant for the classification of the images in T 1 . Therefore, the decision rule in Eq. (9) is simplified to arg max (I M ) and arg max (�I M ) for the all-optical classification of the input test images in T 1 based on the diffractive optical networks D 2 NN-D2 and D 2 NN-D2d, respectively.
Architecture and training of the phase image reconstruction network. The phase image reconstruction electronic neural networks (back-end) following each of the presented diffractive optical networks (front-end) include 4 neural layers. The number of neurons on their first layer is equal to the number of detectors at the output plane of the preceding diffractive optical network (D-1, D-1d, D-2 or D-2d, see Fig. 1a-d). The number of neurons, on the subsequent 3 layers are 100, 400, 1568, respectively. Note that the output layer of each image reconstruction electronic neural network has 2 × 28 × 28 neurons as it simultaneously reconstructs both of the overlapping phase objects, resolving the phase ambiguity due to the spatial overlap at the input plane. Each fully-connected (FC) layer constituting these image reconstruction networks applies the following operations: where ρ l+1 and ρ l denote the output and input values of the l th layer of the electronic network, respectively. The operator LReLU stands for the leaky rectified linear unit: The batch normalization, BN, normalizes the activations at the output of LReLU to zero mean and a standard deviation of 1, and then shifts the mean to a new center, β (l) , and re-scales it with a multiplicative factor, γ (l) , where β (l) and γ (l) are learnable parameters, i.e., (10) The hyperparameter ϕ in Eq. (15) is a threshold for the transition between mean-absolute-error and meansquared-error, and it was set to be 20% of the standard deviation of the ground truth image.
If we let φ p x, y and φ q x, y denote the first and second output of each image reconstruction electronic network (i.e., 28 × 28 pixels per phase object), the image reconstruction loss, L r , was defined as the minimum of two error terms, L ′ r and L ′′ r , i.e., where φ 1 x, y and φ 2 x, y denote the ground truth phase images of the first and second objects, respectively, which overlap at the input plane of the diffractive optical network. As there is no hierarchy or priority difference between the input objects φ 1 x, y and φ 2 x, y , Eq. (16) lets the image reconstruction network to choose their order regarding its output activations.
Other details of diffractive optical network training. With the 0.53 lateral sampling rate in our forward optical model, the transmittance function inside the field-of-view, T in x, y , was represented as a 100 × 100 discrete signal. In our diffractive optical network training, the 8-bit grayscale values of the MNIST digits were first converted to 32-bit double format, normalized to the range [0,1] and then resized to 100 × 100 using bilinear interpolation. If we denote these normalized and resized grayscale values of the two input objects/digits that overlap at the input plane as θ 1 x, y and θ 2 x, y then the transmittance function within the input field-of-view, T in x, y , is defined as, During the training of the presented diffractive optical networks, θ 1 x, y and θ 2 x, y are randomly selected from the standard 55 K training samples of MNIST dataset without replacement, meaning that, an already selected training digit was not selected again until all 55 K samples are depleted constituting an epoch of the training phase. In this manner, we trained the diffractive optical networks for 20,000 epochs, showing each optical network approximately 550 million different T in x, y during the training phase. Supplementary Fig. 10 illustrates the convergence curves of our best performing diffractive optical networks models (D 2 NN-D1d and D 2 NN-D2d) over the course of these 20,000 epochs. Similarly, to generate the input fields in test dataset T 2 , we randomly selected θ 1 x, y and θ 2 x, y among the standard 10 K test samples of MNIST dataset, without replacement, and this was repeated two times providing us the 10 K unique phase input images of overlapping handwritten digits constituting T 2 . In our T 2 test set, 8998 inputs contain overlapping digits from different data classes, while the remaining 1002 inputs have overlapping samples from the same data class/digit. The validation image set (Supplementary Fig. 10), on the other hand, contains 5 K unique phase input images created by randomly selecting θ 1 x, y and θ 2 x, y among the standard 10 K test samples of MNIST dataset without replacement. Note that the θ 1 x, y and θ 2 x, y combinations used in the validation set of Supplementary Fig. 10 are not included in T 2 to achieve true blind testing without any data contamination.
For the digital implementation of the diffractive optical network training outlined above, we developed a custom-written code in Python (v3.6.5) and TensorFlow (v1.15.0, Google Inc.). The backpropagation updates were calculated using the Adam 30 optimizer with its parameters set to be the default values as defined by Tensor-Flow and kept identical in each model. The learning rate was set to be 0.001 for all the diffractive optical network models presented here. The training batch size was taken as 75 during the deep learning-based training of the presented diffractive optical networks. The training of a 5-layer diffractive optical network with 40 K diffractive neurons per layer for 20,000 epochs takes approximately 24 days using a computer with a GeForce GTX 1080 Ti Graphical Processing Unit (GPU, Nvidia Inc.) and Intel® Core ™ i7-8700 Central Processing Unit (CPU, Intel Inc.) with 64 GB of RAM, running Windows 10 operating system (Microsoft).